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Abstract 

Re-identification algorithms are used in data privacy to measure dis- 
closure risk. They model the situation in which an adversary attacks a 
published database by means of linking the information of this adversary 
with the database. 

In this paper we formalize this type of algorithm in terms of true 
probabilities and compatible belief functions. The purpose of this work is 
to leave aside as re-identification algorithms those algorithms that do not 
satisfy a minimum requirement. 



1 Introduction 

Privacy preserving data mining (PPDM) and statistical disclosure control (SDC) 
are two active areas of research that study how to avoid disclosure of sensitive 
information when data is released to third parties for their analysis. 

One of the existing approaches for ensuring privacy consists of manipulating 
the datafile adding some noise or reducing the quality of the information. Several 
data protection methods have been developed in this direction. Noise addition, 
microaggregation, rank swapping and PRAM are some of the existing methods. 
In general, these methods consists of transforming a data file X by means of a 
masking method p into a new data file Y . That is, the masking method returns 
Y:=p{X). 

Methods reduce the disclosure risk at the expenses of some information loss. 
In other words, the results from an analysis of X will in general give different 
results than an analysis of Y . With the aim of quantifying this loss, several 
information loss measures have been defined in the literature. 
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Nevertheless, although the function p inflicts a perturbation in the data 
that causes some information to be lost, the modification of the data may be 
insufficient for ensuring privacy. Due to this, disclosure risk measures have been 
defined and studied in the literature. 

Disclosure risk measures can be defined in terms of re-identification. This 
corresponds to identity disclosure. Re-identification algorithms permit us to 
model the situation in which an adversary wants to attack a published data set 
using some information that he has available. The adversary tries to link his 
information expressed as records in a datafile with the records in the published 
data set. The more records he reidentifies, the larger the risk. Therefore, given 
a particular file, the proportion of reidentified records is a measure of the risk. 

The concept of re-idcntification is also the cornerstone of the theory of k- 
anonymity. A dataset is fc-anonymous if for each record in the dataset, there are 
other fc — 1 records that are equal to it. Nevertheless, as pointed out in [T3], the 
important question here is not whether the records have the same or different 
values, but that the records are indistinguishable in the re-identification process 
(when the adversary attacks the dataset). This idea permitted us in [13] to 
define ri-confusion as an alternative to fc-anonymity which provides the same 
level of anonymity without requiring records to have the same values. 

Because of that, re-identification algorithms are fundamental in data privacy 
and the literature presents several algorithms for re-identification [Tni HSl [H] . 
The literature also discusses some models [T] for re-identification that are 
used to determine the parameters of the algorithms. Nevertheless, up to our 
knowledge, there is no approach for how to formalize and determine correctness 
of re-identification algorithms. That is, there is no discussion on what a proper 
and correct re-identification algorithm is, and what kind of result a correct 
re-identification algorithm should give. 

In this paper we present a formalization of re-identification in terms of belief 
functions and true probabilities. 

The basic idea is that a good re-identification algorithm, given some infor- 
mation, a probability distribution over a population. If we assume that this 
re-identification algorithm behaves correctly, then it cannot return any proba- 
bility distribution but must return a distribution that is compatible with the 
true one. In addition, we would expect that the more information we have 
available, the more the probability of the algorithm should resemble the true 
one. 

In this paper we model this situation in terms of belief functions and 
the transferable belief model }12j . Departing from a true probability, we define 
two types of re-identification algorithms. First, we define a re-identification 
algorithm as one that returns a belief function that is compatible [5] with the 
true probability, and later as one that returns the pignistic transformation of a 
belief function that is compatible with the true probability. 

The structure of the paper is as follows. In Section[2]we review some concepts 
that are needed later in this work. In particular, we discuss belief functions and 
re-identification algorithms. In Section |4l we introduce our model and discuss 
some relevant results about it. The paper finishes with some conclusions. 
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2 Preliminaries 



This section is divided into three parts. In the first part we review some concepts 
related to belief functions, a model for approximate reasoning. We will use belief 
functions to construct a model for record linkage. In the second part we review 
k-anonymity, one of the approaches in data masking. 

2.1 Belief functions 

Belief functions can be used to represent uncertainty with respect to probability 
distributions. We will not go into the details of their justification. The descrip- 
tion in this section focuses on the concepts we need in the rest of the paper. For 
details and additional discussion see e.g. pi [T^ [T7] . 

Definition 1. A set function Bel : 2^ — > [0, 1] is a belief function if and only 
if it satisfies 

(i) /i(0) = 0, m(^) — 1 (boundary conditions) 

(a) A Q B implies j-i{A) < j-i{B) (monotonicity) 

(Hi) For all Ai, . . . , An C X , 

Bel{AiU ...U An) > '^Bel{Aj) - ^Bel{Ajr\Ak) + ...+ 

i-ir+'Bei{A,n...nAn). (1) 

Belief functions are closely related to basic probability assignments. There 
is a basic probability assignment for each belief function, and a belief function 
for each basic probability assignment. 

Definition 2. A function m : 2^ [0, 1] is a basic probability assignment if 
and only if 

(i) m(0) = 

(ii) EacxMA)^! 

There exist two names for this function in the literature: basic probability 
assignment (e.g. in [TT]) and basic belief assignment (e.g. in jH]). In the rest 
of the paper we will say just assignment. 

The following proposition establishes the relationship between assignments 
and belief functions. 

Proposition 3. Let Bel be a belief function defined on the reference set X , 
then the function defined below is a basic probability assignment m: 

m^(A) = ^ (-1)1^1-1-^1^6/(5) for all A ex. (2) 

BCA 
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Let m he a basic probability assignment, then, the function Belm defined below 
is a belief function 



Belrn{A) := ^ m{B) for all ACX. 



(3) 



BCA 



Belief funetions generalize probabilities. In particular, when m{A) = for 
all A such that \A\ > 1, then Bel is a probability. In this case, the assignment 
m to the singletons is the probability distribution. That is, P{{x}) = m{{x}) 
for all x€X, and then P{A) = Bel{A) for all A<ZX. 

As stated before, belief functions can represent uncertainty in probability 
distributions. They permit to differentiate situations which standard proba- 
bilities cannot. For example, total ignorance in a set X is modeled defining 
m{X) = 1 and m(A) = for all A X. In contrast, when we know that the 
elements in X all have the same support we assign m{x) = 1/|^| for all x & X. 
Note that this is different from the case of standard probabilities where both 
situations are represented by P{x) = l/\X\ for x G X. 

Given a belief function Bel defined from m, Dempster defined the pignistic 
transformation as a function that finds a probability distribution from Bel. This 
pignistic transformation is based on the transferable belief model by Smets [12] 
that distinguishes between the credal and the pignistic level. The credal level 
is where beliefs are taken into consideration and operated on, and the pignistic 
level is where beliefs are used. Although we do not understand probabilities 
and beliefs as subjective, as Smets does, both levels are appropriate for mod- 
eling re-identification. An ideal re-identification algorithm will compute belief 
functions in the credal level, with the minimal possible commitments in case 
of uncertainty. Then, when decisions are to be made, we move to the pignistic 
level and probabilities are made concrete. 

Definition 4. Let Bel be a belief function, then we define the pignistic proba- 
bility distribution derived from Bel, Psei, as: 



2.2 Data protection methods 

Formally, given a data set X , a masking method p constructs a data set Y := 
p{X). Data privacy studies masking methods that return datasets which can be 
released to third parties in a way that avoids disclosure of sensitive information, 
but preserves the value of the data as material for analysis. 

One of the existing concepts for data privacy is /c-anonymity. A dataset 
satisfies fc-anonymity when for each record there are fc — 1 other records that 
are indistinguishable in the dataset. 

Several algorithms have been proposed in the literature to build a dataset 
compliant with fc-anonymity through generalization, suppression and clustering. 
For example, if we have 6 records with values 18,16,19,22,24,24 for attribute Vi, 




{Be2^:xeB} 




m{B) 
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we can consider the intervals [15, 19], [20, 25] and then recede the original values 
according to these intervals. Doing so, we will have three in the interval [15,19] 
and three other in the interval [20,25]. This ensures fc-anonymity for fc = 3 if 
only this sole attribute is discussed. 

We will use genv^ to denote a method that ensures fc-anonymity for a single 
attribute Vi using generalization for some appropriate value k. 

3 Re- identification algorithms 

Given a dataset X, a protection method p, and the protected dataset Y p{X), 
disclosure risk can be measured in terms of the number of records in Y that 
can be correctly reidentified. Indeed, a common approach when constructing 
re-identification algorithms is to optimize with respect to this criterion [T] . Nev- 
ertheless, although formalizations of the expected outcome of these algorithms 
exist (i.e., we expect a method to maximize the number of correct links), no for- 
malization exists on what we mean when we say that a re-identification method 
is correct. We would like a formalization that excludes re-idcntification methods 
which perform incorrect re-identifications. In this paper, we discuss a formaliza- 
tion based on belief functions and a true probability. We will base our discussion 
on a previous definition of re-identification algorithms used to define n-confusion 
in [13]. The definition is as follows. 

Definition 5. J13f Let p he a method for anonymization of databases, X a 
table with n records indexed by I in the space of tables D and Y := p{X) the 
anonymization of X using p. Then a re-identification method is a function that, 
given a collection of entries y in ViY) and some additional information from 
a space of auxiliary informations A, returns the probability that y are entries 
from the record with index i £ I , 

r: ViY)xA [0,1]" 

(y, a) I— > {P{y corresponds to record X[i]) : i £ I) . 

Consider the objective probability distribution corresponding to the re-identification 
problem. Then, we require from a re-identification method that it returns a prob- 
ability distribution that is compatible with this probability, also when missing 
some relevant information. Compatibility can be modeled in terms of compati- 
bility of belief functions (see fM 

Section |4] discusses re-identification algorithms and the compatibility issue 
mentioned above. Before, we review some of the approaches that can be found 
in the literature on record linkage. Recall that we have defined record linkage 
in terms of the probability that y, the protected record, are entries from the 
record with index i £ I, and we denote this by r(y,a)[i]. 

As we will see, for some methods we can understand probabilities as following 
a Bayesian objective approach, and for other methods as subjective probabilities. 
In the latter case, we can understand the probabilities as votes. 
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Some of the methods described below are not completely formalized in the 
literature, so the formulation is ours. 

• /c-anonymity and re-identification. Re-identification methods applied 
to /c-anonymous databases return for each y a list of possible records J d I 
that are possible matches of the given record. 

On the basis of the principle of insufficient reason (or principle of indif- 
ference) this can be modeled by means of a uniform distribution over the 
records in J . That is, r(j/,a)[i] = l/l^/l for all i E J . 

An alternative model, not previously considered in the literature up to our 
knowledge, is to consider belief functions. This will be the subject of this 
article, and Example 1111 focuses on the use of belief functions for their use 
in fc-anonymity. 

• Probabilistic record linkage. The mathematical model formalized by 
Fellegi and Sunter in 1969 [S] is based on a probabilistic model that com- 
putes the probability of a particular coincidence pattern 7 conditioned 
by the existence of a match: P{^\MatcK). Probabilistic record linkage 
returns the probability of a correct match given a particular coincidence 
pattern (i.e, P [Jvl atchY{)) . The Bayes' rule is used in this process. This 
situation can be modeled by 

'r{y,a)^\ = P{Match\'-f{y,Xi)). 

• Specific attacks to data protection methods. The approach to attack 
rank swapping in |9] can be represented in terms of a list of candidates, 
in the line of the re-identification methods for /c-anonymity as described 
above. Re-identification attacks for rank swapping p-buckcts can be mod- 
eled by means of probabilities. 

• Distance-based record linkage. Some literature exists where re-identification 
methods assign to each record in one file the most similar record (at a 
minimum distance) in the other file. In this case, probabilities can be 
defined from the distances, but such assignments should typically be only 
interpreted as voting or indications for subjective probabilities. [3] is an 
exception to this, where a real probability is estimated taking into account 

the similarity between records. 

4 A formalization of re-identification 

In this section we analyse the concept of re-identification further. In Defimtion[5l 
re-identification is a function that, given some partial information on Y and some 
additional information, returns a probability distribution on the set of records. 

We claim that this probability distribution should be compatible with the 
true probability. Our motivation is to create a theoretical foundation for disclo- 
sure risk evaluation. The formalization will leave aside those re-identification 
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algorithms that do not satisfy some minimum requirements. In particular, we 
will not approve algorithms that deliver incorrect results on purpose, and we 
will force algorithms to perform as well as possible, according to the evidence 
found in the data and any a priori knowledge. For the purpose of risk evaluation, 
using the worst case scenario, this implies no loss of generality. 

In this section, we introduce a formalization that relies on a true probability 
of re-identification. This true probability corresponds to the case in which we 
know everything about the whole anonymization process, and assuming this 
information is used in the re-identification process. That is, the true probability 
only includes the uncertainty that cannot be removed because e.g. randomness. 

We would expect that the re-identification process leads to a probability that 
is less informative than the true probability in case of uncertainty, e.g. on the 
masking process or on the data available for re-identification. Examples of such 
uncertainty could be that some variables are not included in the risk analysis, 
or that part of the masking process is not disclosed and cannot be taken into 
account in the risk analysis. 

Nevertheless, uncertainty does not justify all probability distributions. Only 
some of them are valid. As an extreme example, we cannot accept as a re- 
identification method one that assigns r(y, A)[i] = 1 if and only if z = zg for any 
y £Y . In order to represent less informative probabilities we use imprecise prob- 
abilities and, more specifically, belief functions. As stated in Section [^TTl belief 
functions can be used when there is uncertainty in the values of a probability 
distribution. When no additional uncertainty is present in the re-identification 
process, the corresponding belief function is equivalent to a probability distri- 
bution. 

We will pressume that an ideal re-identification method is the one that ex- 
presses uncertainty by means of a belief function. The belief function computed 
by this re-identification method should be compatible with the true probability. 

Here we use the term compatible according to Chateauneuf [2] , who defined it 
for belief functions. Definition 1 in [5] defines two belief functions as compatible 
when the joint information is non-empty. The definition which we will use here 
is the same as Chateauneuf 's definition except for the fact that wc will compare 
a probabilty (the true one) and a belief function. 

We will use P to denote the true probability of re-identification. We give its 
formal definition below. 

Definition 6. Let X be a dataset, p a data masking method, and Y := p{X). 
Then, we define the true probability Pp,x,Y{xi\yi) as the probability that the 
protected record yi proceeds from the record Xi given p, X , and Y . 

Given a true probability and a belief function, we define their compatibility 
as follows. 

Definition 7. Given a probability P, we say that a belief function Bel is com- 
patible with P if P > Bel . 

For the sake of illustration, let us consider the following example with partial 
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information in the re-idcntification process. This will be a running example of 
this paper. 

Example 8. Let X he a dataset with different attributes Vi^ . . . , V,-n- Consider 
the masking method p where each attribute Vi is protected by means of a gen- 
eralization method genv^ which ensures k-anonymity for k ~ ki. That is, given 
that X\yi\ represents the column of X with attribute Vi, we have that genv^ is 
applied to X\yi\ for i G {1, . . . , m}, and Y is defined in terms of the results 
of genvi putting their results side by side as follows 



The true probability Pp^x.Y for the re- identification of X and Y := p{X) 
for a given record y G Y assigns the same non-zero probability to all records 
X in X such that y can proceed from x (taking into account the generaliza- 
tion processes geny.), and assigns to all other records. Formally, let the 
record y be y = (j/i, . . . , and let us define the candidate set of y as the 
records x X such that y can proceed from x (i.e.. Candidates et(y) = {x\y = 
{genvi{xi), . . . , geny^ixm))}). Then, the true probability of x given y is defined 



Re-identification methods that are applied to subsets of Y consisting of only 
some attributes will lead to probability distributions that may be different from 
the true probability distribution. If this is the case, then they will be less 
informative. Note that, when only a subset of attributes V' C {Vi, . . . ,Vm] 
arc considered, then the re-identification algorithm may select more candidates 
than there are in the true candidate set. 

In the next example we consider the re-idcntification of the ith register of Y 
taking into account only partial knowledge consisting of some of the attributes. 

Example 9. Let X , p, Y := p{X), y EY and geny^ be defined as in Example\^ 
Let Attrs{j) for j = 1,...,2™ represent all possible (non-empty) subsets of 
attributes ofV^{Vi,..., Vm} indexed by j. Let y, C V[Y) for j = 1,..., 2" 
represent a record of the database Y restricted to Attrs(j). An example of an 
indexation of attribute subset is Attrs{l) = {Vi} and Attrs{3) ~ {Vi, V2}. 

Then, we expect a re-identification method applied to yj, and taking into 
account how Y is generated from X using p, to deliver the following probability 
distribution: 



where C andidateS ets A{yj) includes a record x € X if y can proceed from x 
when only the attributes in A are considered. 



Y := p{X) = [genvAX[Vi])\\. ■ .\\genv^{X[V,n])] . 



by: 
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It is easy to see that for any record yinY, we have 



Candidates et{y) C CandidateSetj^nj.g(^j'^{y) 



and therefore Pp,x,Y{xi\y) > r(y, a)[z] for all Xi in X. 

In this example the re-identification method assigns to all records in the 
candidate set the same probability. This is the usual way to assign probabilities 
under the principle of indifference. Nevertheless, in reality we only know that 
the true match is one of the records in the candidate set and that we do not 
have any preference on them. If we instead allow the re-identification to return 
a belief function, then this situation can be properly expressed. We define now 
the concept of re-identification method expressing uncertainty, a rc-idcntification 
method that is not required to assign probabilities to singletons. 

Definition 10. Let p be a method for anonymization of databases, X a ta- 
ble with n records indexed by I in the space of tables D and Y := p{X) the 
anonymization of X using p. Let Pp,x,Y{x\yi) be the true probability of p, X 
and Y . Then a rc-idcntification method expressing uncertainty is a function 
that, given a collection of entries y in 'P{Y) and some additional information 
from a space of auxiliary informations A, returns the belief function compatible 
with the true probability Pp^x.Yixi\y) that y are entries from the record with 
index i € I 



As for any belief function, in this definition we expect 

(1) m{X) = 1 and m{A) = for all A ^ X when there is no evidence on which 

are the original records corresponding to the protected record y, and 

(2) an increment of the belief function for B C X when the evidence increases 

for records in B. 

In addition, the same belief functions will apply to different protected records 
whenever these have the same values. Formally, we have that Bel{y,a) = 
Bel{y', a) if y = y' holds. 

The following example illustrates the use of a re-identification method ex- 
pressing uncertainty. 

Example 11. Let X, p, Y := p{X), y €z Y , yj, genvi, CandidateSet and 
CandidateSetA be defined as in Examples [3 and [PI Then, the belief function 
r*[yj,a) that better represents the uncertainty is defined by the following assign- 
ment: 



V{Y) X A ^ 



[0,1] 



{m{y proceeds from a record in B) : B C X) . 




1 if A = CandidateSet jm^g(^j^{y) 
otherwise 



9 



Therefore, for all B C X , 

r*{y„a){B)=Y,m{A) 

ACB 

It is easy to see thatr*{yj, a){B) = 1 if and only if CandidateSetji^ttrs{j)iy) Q 

B. 

For the belief functions in this example we can prove the following. 

Proposition 12. The belief functions r*{yj,a) defined in Example{T^ are com- 
patible with the true probability in Example\^ 

Proof. For simplicity, let us use the notation Bel{B) = r*{yj,a){B). We need 
to prove that P(C) > Bel{C) for aU C C X. Since Bel{B) e {0,1}, we only 
need to check two cases. 

• When Bel{C) ^ 0, it is clear that P{C) > Bel{C) for aU C. 

• When i3e/(C) = 1, then C 3 CandidateSetyittrs{j)iy) ^ Candidates et{y). 
Therefore, P(C) > P(CandidateSetAttrs{j)(.y)) ^ P{CandidateSet{y)) = 
1. So, P(C) = Bel{C). 

In the case of Bel{C) = 1 we use the condition discussed above that 

Candidates et{y) C CandidateSetj^nj.g(^j^{y). 

This proves the proposition. □ 

In contrast, if the re-identification method assigns m{B) = 1 to a set i? that 
misses one record Xi of the candidate set of y, then the inferred belief function is 
not compatible with the true probability. This is formalized in the next lemma. 

Lemma 13. Let xq be a record of the candidate set, let B be an arbitrary subset 
of X, and let 

Co = (-B U Candidate Set(y)) \ {xo}. 

Let a re-identification method assign m{Co) = 1 and m(C) ~ for all C ^ X 
such that C ^ C^. The belief function induced from m is not compatible with 
the true probability. 

Proof. It is easy to see that the belief function satisfies Bel{Co) — 1. Never- 
theless, since Cq does not include xq, we have that the true probabiliy for Co 
is ^ 

^ "■^ \CandidateSet{y)\ 

Therefore, as -P(Co) < Bel{Co), the belief function is not compatible with the 
probability. □ 

It is important to note that this result removes from the set of valid re- 
identification methods expressing uncertainty those that miss the correct records 
from the candidate set. 
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4.1 Pignistic transformation and re-identification meth- 
ods 



In the previous section we defined a general re-identification method that re- 
turns a behef function, and this belief function is required to be compatible 
with the true probability. Nevertheless, as re-identification methods in real ap- 
plications return probabilities, we reconsider our definition of re-identification 
algorithms so that they also return probability distributions. Nevertheless, these 
probability distributions are required to proceed from the belief function. 

In particular, the probability is constructed from the belief function following 
the principle of insufficient reason (or principle of indifference). That is, the 
assignment m to a set is distributed to the singletons of this set according to 
a uniform distribution. We say that a probability constructed in this way is 
compatible with the original distribution. 

This construction precisely corresponds to the pignistic transformation and 
follows the transferable belief model by Smets. Details of a characterization of 
the transformation is given in 1 12 1 . This pignistic transformation was defined in 
Definition [H 

Definition 14. Given two probabilities P and P' , we say that P' is compatible 
with P if there exists a belief function Bel compatible with P such that P' is the 
pignistic probability distribution derived from Bel (i.e., P' = Psei)- 

We now present the pignistic probabilities for the running example. 

Example 15. The pignistic probabilities for the belief functions of Examvle \ll\ 
for yj are as follows: 



CandzdateSet^„..b)(!/)l ^ C audidateSct AttrsuM 



P{yj, a)[i] = >, ■" ■ 

otherwise. 

The probabilities defined in this example satisfy the following inequalities: 



If Xi G Candidates et{y), then Xi G Candidates etAttrs{j){y) for all j. 
Then, we have 

)[ 1 \CandidateSetAttrs(j){y)\ ~ ''"'^'^^ ^^^^ \C andidateS et{y)\ 
If Xi ^ C andidateS et{y) and Xi £ C andidateS etAttrs(j){y)i then we have 

= 1^ ,. , , I , TY\ ^ Pp,x,Y{x^\y) = 

\L andidateS etAttrs{j)\y)\ 

If Xi ^ C andidateS et{y) and Xi ^ Candidates etAttrs{j)iy), then we have 
Piy„a)[i]=Pp,x,Y{xM = 
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Therefore, in general, for a given Xi the pignistic probability can be either 
larger or smaller than the true probability. However, it cannot assign a zero 
probability to an Xi in the candidate set. 

In fact, the next proposition proves that the support of the probability of 
the re-identification method should contain the support of the true probability. 

Proposition 16. Let P be a true probability, let Bel be a belief Junction com- 
patible with P, let P' be the pignistic probability derived from Bel. Let Bp he 
the support of P (i.e., Bp = {x\P[x) ^ 0}J, and let Bpi be the support of P' . 
Then, Bp C Bp,. 

Proof. Let be an arbitrary element of the support of P, so P{xq) ^ 0. Let 
a = 1 — P(xq). Then, taking into account that Bel is compatible with P, we 
have 

l> a = l- P{xa) = P{X\{x^})>Bel{X\{xi:,})^ ^ m(C). 

C<ZX\{xa} 

So, 

1 - ^ m(C) > 0. 

C<ZX\{xa} 

As X^ccx "^(C) ~ li wc have 

m{C) = 1 - m{C) > 0. 

CCX:xoeC CCX\{xo} 

So, there exists at least one C such that xq e C and m{C) ^ 0. 

Then, by definition of the pignistic transformation, for all x B, P'{x) ^ 0. 
Therefore, xq is an clement of the support of P'. As xq is an arbitrary element 
of the support of P, the statement is proven. □ 

Now we will discuss two properties of the probability of rc-idcntification that 
concern the case in which the probability is one for a single record. That is, we 
have that the true probability is a Dirac delta distribution at a single record xq . 
This distribution is denoted by 6{xo) and its value is 1 if and only iix = xq. Note 
that this case is possible in Example [8] when the intersection of the candidate 
sets of two (or more variables) is a singleton. Formally, \C andidateS etviiy) H 
CandidateSetvj{v)\ = 1. A similar situation was exploited in [9] to attack rank 
swapping. 

Lemma 17. Let P he a Dirac delta distribution at Xq. Let Bel he a belief 
function compatible with P. Then, m(A) = if any only if AD {xq} = 0. 

Proof. Suppose that A n {xq} = 0, then we have P{A) = 0. 
As P{A) > Bel{A), then we have Bel {A) = 0. 

Therefore, as Bel{A) = = J^BcA'^i^) m{B) > for all P C ^, we 
have m{A) = 0. 

The fact that m{A) = implies Anjxo} = is a corollary of Proposition[T6l 

□ 
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Proposition 18. Let P{i) = r{y,a) be a Dime delta distribution at ig. Let 
P' be a probability compatible with P. Then, if P' has its maximum in i' (i.e. 
i' = argniaxP'(i)J, then i' = ig. 

Proof. Suppose that P{i) ~ for aU i ^ io, and P{io) = 1- Lemma [T71 impUes 
that m{A) = for all A n {io} = 0. Let A = {A\m.{A) ^ 0}, then for any belief 
function compatible with P, 

for all ii ^ io- 

If P' is compatible with P, then there exists a belief function Bel compatible 
with P such that P' = Pb^i- □ 

Note that this proposition is valid only when the true probability is a Dirac 
delta distribution. However this should not usually be the case if the data 
protection algorithm is effective. For example, in Example [8] there may be y 
with \CandidateSet{vi)\ = 1, but for other \CandidateSet[%j)\ = A: > 1, so that 
the probability is 1/fc < 1. In general, we might even have that the record with 
a maximal probability is not one of the records in the candidate set. The next 
proposition establishes this fact. 

Proposition 19. Let X be a reference set with \X\ > 3, let A C X be a set 

of k > 2 records with a true probability for record y equal to l/k. Then it is 
possible to have a probability compatible with the true probability such that the 
record with maximum probability is none of the ones in A. 

Proof. Let P represent the true probability. Then P{x) = 1/fc for all x G A. 

Consider a record xq not in A. Therefore P{xo) = 0. Then, define a belief 
function Bel in terms of m as follows: 



m(C) 



■i if C = {.To, x} for any x ^ A 
otherwise 



First we prove that this belief function is compatible with P. To do so, we 
need to prove that P{B) > Bel{B) for all B C X. To do so, we consider two 
cases for the sets B C X according to the membership of xq to B. Note that in 
both cases we have P(P) ^ \B D A\ ■ (1/fc). 

• Case .To e B: As Bel{B) ^ \B D A\ ■ (1/fc), we have that P(P) = Bel{B). 

• Case To ^ B: As Bel{B) = 0, we have that P(P) > Bel{B) = 0. 

Now we consider the pignistic probability from Bel. It is easy to prove that 

( ini [fxeA 

PBel{x)=< 1/2 ifT = To 

[ otherwise 

Therefore, we have that To is the record with maximum pignistic probability 
when precisely tq is not in A. □ 
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This proposition implies the following corollary. 

Corollary 20. Given a compatible probability P' of a true probability P , the 
record with maximal value in P' can be different from the record with maximal 
value in P. 

The next example illustrates a masking method that leads to a belief function 
and a probability distribution as the one in the above Proposition 1191 

Example 21. Let p be a masking method defined as follows for records x inN^ . 

1. Let a be a random number in {0, 1} according to a uniform distribution. 

2. Let (3 be a random number in {1,2,3} according to a uniform distribution. 

3. Let y :~ X + aep where is the unit vector in N"^. 

Given A = {xo = (000), = (100), = (010), X3 = (001)} and X 2 A, we 
can model the re-identification of y ~ (000) by means of the belief function in 
Provosition \19i Therefore, when we guess by selecting the most probable record 
using the pignistic transformation, we select xq ■ However, if a is known to be 
zero, Xo is impossible. 

The results given in this section describe the behaviour of our formalization 
for re-identification. At the same time, they give constraints on what we consider 
to be a proper re-identification algorithm, and, thus, they define the minimal 
requirements for these algorithms. 

4.2 Evidence and uncertainty measures 

When new information is given to the re-identification algorithm, the belief 
function is updated according to this new evidence. The most particular case is 
when we consider that mass is transferred to a set Ci from a larger set C2. That 
is, we increment the mass of Ci while reducing the one of C2 and not modifying 
the rest of sets. 

The literature presents several definitions of uncertainty measures to evaluate 
either belief functions or probability distributions. Klir and Wierman [7] give 
an account and a classification of some of these measures. 

In this section, we first show with examples that the entropy of the pignistic 
probability is not monotonic. We give these examples because entropy is often 
interpreted as a measure of information and, as such, one might think that in 
our case Ci is more informative than C2. As the examples show, in some cases 
the entropy is monotonic to this type of transformations, but in other cases it 
is not. 

Later, we prove that the measure of nonspecificity is monotonic with respect 
to the changes caused by transferring evidence from C2 to Ci . This measure was 
defined by Dubois and Prade [3] as a generalization of the measure by Higashi 
and Klir [6] . A characterization of this measure was given by Ramer in |10| (see 
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Klir and Wierman [7] for details). We consider that this measure represents 
better the quantity of information in a belief function. 

The section finishes with a discussion on the uncertainty measures for re- 
identification. 

Definition 22. The entropy of a belief function Bel is defined as the entropy 
of the pignistic probability distribution derived from Bel. That is, 

Entropy (Bel) = ^ Psei (x) log Peel {x) (4) 

Let us now consider two examples of the entropy of belief functions. 

Example 23. Let X = {xi, . . . , x^} and let Bel be the belief function defined 
by 

• m{{xi, . . . , xg}) = 0.07692307 • 5 

• m{{xi, . . . , xs}) = 0.07692307 • 8 

The pignistic probability Psei corresponds to: 

. Pselixi) = PBel{x2) = Pselixz) = PBel{Xi) = PBel{x^) = 0.15384614 

• PBelixe) = PBelixj) = PBel{xs) = 0.07692307 

Define Bel' by transferring mass from C2 = {a;i, . . . , xs} to Ci = {xi, X2} ■ 
We have that, Ci C C2, and, therefore, Ci is more specific than C2. Let be the 
transferred mass be equal io A = 0.038461544- 8. Therefore, we have that the 
new belief function is defined by: 

• to'({xi, X2}) = m({a;i, 2:2}) + A = m({a;i, X2}) + 0.038461544 •4-2 

• ra'{{xi,...,xz]) = m({2;i,...,X5}) = 0.07692307-5 

• m'{{xi,...,xs,}) = 0.07692307-8- A = 0.07692307-8-0.038461544-8 = 
0.03846153 - 8 

The pignistic probability PbcI' corresponds to: 

• PBei'{xi) = PBei'{x2) = 0.038461544-4 = 0.15384617 

• PBei'{xz) = PBei'{xi) = PBei'ixr,) = 0.07692307+0.03846153 = 0.1153846 

• PBei'ixe) = PbcAxt) = PBei'ixg) = 0.03846153 

The entropy of PbcI is 2.0317593 and the entropy of PbcV is 1.8300099. 
So, in this case transferring mass/evidence from a larger set to a smaller one 
reduces entropy. 

Example 24. Let X = {xi, . . . ,a;io}. Let Bel be the belief function defined by: 

• m{{xi, . . . , xio}) = 0.08333332 - 10 
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• m{{xi,X2}) = 0.08333332 • 2 

The pignistic probability PbcI corresponds to: 
. PBelixi) = PBel{x2) = 0.16666664 

• PBel{xz) = PBel{Xi) = Pselixs) 

= PBeliXd) = Pselixj) = PBelixs) 

= PBeiixg) = PBeiixio) = 0.08333332 

Define Bel' by transferring mass equivalent to A ^ 0.08333332 • 10 from 
C2 — {xi, . . . , xio} to Ci — {x3, . . . , xio}. Then, the new belief function Bel' is 
defined by: 

• m'{{xi,. . .,xio}) = Tn{{xi, . . . ,a;io}) - A = 

• m'{{x3, xio}) = m{{x3, xio}) + A = + 0.08333332 • 10 
= 0.10416665-8 

• m'{{xuX2}) = m{{xi,X2}) = 0.08333332-2 

Therefore, its pignistic probability PbcV corresponds to: 

• PBel{xi) = PBel{x2) = 0.08333332 

• Pselixs) = PBel{Xi) = PbcAxz) 

= Pselixe) = PBelixr) = PBelixs) 
= PBelixg) = PBeiixio) - 0.10416665 

Here, we have that the entropy of Psei is 2.2538579 while the one of Psei' 
is 2.2989538. So, we have that the entropy of the pignistic distribution of the 
belief function with more information PbcV is larger than the entropy of the 
other distribution PbcI ■ 

The behaviour of the entropy in these two examples can be explained from 
the fact that the entropy is a Schur-concave function (see e.g. [5] for details). 
In the first example, PbcI majorizes PbeV , and therefore entropy{PBei) > 
entropy (PBei'). In the second example, is PbeI' who majorizes Psei and thus 
entropy {Pbcv) > entropy{PBei)- 

We prove now that the measure of nonspecificity is monotonic with respect 
to a mass transfer. First we introduce this measure. 

Definition 25. ^ The measure of non- specificity N for a belief function Bel 
is defined by 



For this measure, the following holds. 

Proposition 26. Let Ci, C2 be two subsets of X such that Ci C C2, let Bel be 

a belief function defined by m and Bel' a belief function defined by the following 




(5) 



ACX 



m' 
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• m'(Ci) =m(Ci) + A 



. m'(C2) = to(C2) - A 

• m'{A) = m(A) /or aZ/ A ^ Ci and A ^ Ci 

where A is a value such that m{A) G [0, 1] for all A ^ X. 

Then, the nonspecificity of Bel is larger than the nonspecificity of Bel' , that 

is 

N{Bel) > N{Bel'). 

Proof. To prove this proposition, let us consider the nonspeeifieity of Bel and 
put it in terms of the nonspecificity of Bel': 

N{Bel') = Y.A^xm'{A)\og\A\ 

= T,ACX.A=^C\'^'(^)^Og\A\ 

+m'(Ci)log|Ci| 
+m'(C2) logical 
= J2acx.a^c, "^U) log \A\ 
+(m(Ci) + A)log|Ci| 

+(m(C2)- A) logical (6) 

= Y.ACX.A^C^MA)^Og\A\ 

+m(Ci)log|Ci| 
+(m(C2)log|C2| 
+Alog|Ci| 
-AlogjCa 
= iV(BeO + A(log|C7i| -log|C2|) 

As \Ci\ < IC2I we have that log \Ci \ - log IC2I < 0, so that N{Bel) > N{Bel'). 

□ 

Wc have seen in this section that nonspecificity is useful to measure the in- 
formation in a belief function, and that entropy is not. The failure of entropy in 
this context might seem to be in contradiction with the fact that entropy typi- 
cally is understood as a measure of information. Nevertheless, the classification 
of uncertainty measures given in Klir and Wierman [7] sheds some light over 
this issue. Entropy is classified as a measure of conflict. In this sense, the belief 
function m! in Example [24] presents a larger confiict for a decision than when 
mass is transferred from C2 to Ci . This is so because the probabilities are much 
more similar in m' that in m although we have more information in m' than in 
m. In contrast, Klir and Wierman classify non-specificity as a measure of im- 
precision, which is said to be connected with sizes (cardinalities) of relevant sets 
of alternatives (see p. 43 in [7]) which is precisely the case here. Proposition [26] 
clearly shows that any transference of mass from a set to a more concrete one 
will always increase the measure. Indeed, we have that N{Bel) — when Bel 
is a probability distribution. 
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4.3 Conditioning 

When additional information is given to the re-identification algorithm, the 
belief function is expected to change accordingly. In general, we can pressume 
that this information can also be represented in terms of a belief function. Note 
that conditioning in probability theory can be expressed by a belief function. 
For example, conditioning by the presence of an element in a set B which in 
probability theory results into the probability P{A\B), will be expressed by the 
conditioning with respect to the belief function generated from mB{B) = 1 and 
mB{A) = for all A ^ Naturally, the belief function used in the conditioning 
should also be compatible with the true probability. 

Therefore, given two belief functions compatible with the true probability, 
the conditioning should lead to another belief function, also compatible with 
the true probability. 

Definition 27. Given two belief functions Beli and Bel2 compatible with a true 
probabilty P , an acceptable combination function C is a combination function 
that returns a new belief function Bel that is compatible with P and such that 
N{Bel) < TJim{N{Beh),N{Bel2)). 

Any combination function that satisfies this property will be suitable for 
conditioning. See e.g. jl7i I15j for functions satisfying this property. 

Definition 28. Given r belief functions Beli, . . . , Belr compatible with a true 
probability P, we define their combination C as the extension of the acceptable 
combination function C as follows: 



Then, when different items iti, . . . , itk of additional information arc consid- 
ered, all of them expressed by means of belief functions Belu^ , • ■ • , Bela^. , which 
are compatible with the true probability, and we combine them, the result will 
converge to be the true probability, and the nonspecificity is reduced. When 
the true probability is achieved, we have that the nonspecificity is zero. 

Proposition 29. Let iti, . . . ,iti^ be a set of items of additional information 
expressed by means of belief functions Belu^, . . . ^Beln^ compatible with a true 
probability P. Let C be an acceptable combination function. 

Then, the combination of belief functions Belu^ . . . Belu^ for r < k using C 
as in Definition \28\ is a belief function Belr compatible with the true probability 
P, and such that N{Belr) > N{Belr+i). 

In addition, if, for a given rg the belief function Belr„ is a probability, then 
Belrg = P and for all r > ro we have Belr^ = Belr- 

The proof of this proposition is trivial taking into account that C is an 
acceptable combination function and that N{Bel) = when Bel is a probability 
distribution. 
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5 Conclusions 



In this paper we have formahzed re-identification algorithms in terms of belief 
functions and probabihties. We have shown that belief functions and their pig- 
nistic transformation permits us to express the uncertainty in re-identification 
algorithms in a natural way. 
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